home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Language/OS - Multiplatform Resource Library
/
LANGUAGE OS.iso
/
oper_sys
/
quartz
/
quartz10.lha
/
doc
/
thread.ms
< prev
Wrap
Text File
|
1990-05-16
|
77KB
|
1,704 lines
.\" Two macros for code display --
.\" Use fixed width font so indents show up right,
.\" decrease point size and vertical spacing by two
.de LS
.br
.DS B \\$1
.\" Need courier font for code display
.ft C
.ps -2
.vs -2
..
.de LE
.vs +2
.ps +2
.ft P
.DE
.br
..
.EQ
delim $$
.EN
.nr LL 6.5i
.nr PO 1.0i
.nr PI 0.2i
.nh
.nr PS 11
.KS
.sp 0.25i
.KE
.LP
.ce 1000
.LG
.LG
\fBThe Performance Implications of Thread Management Alternatives\fR
.sp 0.5
\fBfor Shared-Memory Multiprocessors\fR
.sp 1.5
.SM
Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy
.sp 0.5
.SM
Department of Computer Science
University of Washington
Seattle WA 98195
.sp 0.5
February 1989
.sp 2
.LG
\fBAbstract\fR
.SM
.ce 0
.sp 0.5
.PP
.nh
Threads ("lightweight" processes) have become a common element
of new languages and operating systems.
This paper examines the performance implications of several data
structure and algorithm alternatives for thread management in
shared-memory multiprocessors. Both experimental measurements and
analytical model projections are presented.
.PP
For applications with fine-grained parallelism, small differences in
thread management are shown to have
significant performance impact, often posing a tradeoff between
throughput and latency. Per-processor data structures can be used
to improve throughput, and in some circumstances to avoid locking,
improving latency as well.
.PP
The method used by processors to queue for locks is also shown
to affect performance significantly. Normal methods of critical
resource waiting
can substantially degrade performance with moderate
numbers of waiting processors. We present an Ethernet-style
backoff algorithm that largely eliminates this effect.
.sp
.LP
.nh
\fIIndex Terms\fR \- Multiprocessor, thread, locking, performance,
parallel software, parallel computing.
.FS
.sp 0.5
This material is based on work supported by
the National Science Foundation
(Grants No. CCR-8619663, CCR-8703049,
and CCR-8700106),
the Naval Ocean Systems Center,
U S WEST Advanced Technologies,
the Washington Technology Center,
and Digital Equipment Corporation (the Systems
Research Center and the External Research Program).
.sp
Authors' address: Department of Computer Science FR-35,
University of Washington, Seattle WA 98195; (206) 543-1695;
\fItom/lazowska/levy@cs.washington.edu\fR.
.FE
.sp 2
.LP
.NH
Introduction
.nr H1 1
.nr H2 0
.PP
.nh
The purpose of this paper is to study the performance implications of
thread management alternatives for shared-memory multiprocessors.
.PP
In traditional operating systems, a process, consisting of a single
address space and a single thread of control within that address space,
is used to execute a program. Within the process, program execution
entails initializing and
maintaining a great deal of state information.
For instance, page tables, swap images, file descriptors, outstanding
I/O requests, and saved register values are all kept on a per-program,
and thus per-process, basis. The sheer volume of this information
makes processes expensive to create and maintain.
.bp
.PP
Threads, or "lightweight" processes, separate the notion of
execution from the rest of the definition of a process. A single
thread executes a portion of a program, cooperating with other threads
concurrently executing within the same address space.
Like processes, every thread must have a separate program counter and stack
of activation records, describing the state of its execution.
However, much of what is normally kept on a per-process basis
can be maintained in common for all threads executing in a single program,
with dramatic reductions in overhead.
.PP
Thread packages have become a common element of new languages
and operating systems for both uniprocessor and multiprocessor
architectures. Mach [Accetta et al. 1986], Topaz [Thacker et al. 1988],
Psyche [Scott et al. 1988], DYNIX [Sequent 1988], and several extensions
to UNIX [Bach & Buroff 1984; Edler et al. 1988] are examples of
operating systems that provide explicit
support for concurrent or parallel execution of programs.
Ada [Mundie & Fisher 1985], CSP [Hoare 1978], Presto [Bershad et al. 1988a],
Mesa [Lampson & Redell 1980], Concurrent Euclid [Holt 1982], and
Emerald [Jul et al. 1988] evidence equal interest within the language
community.
.PP
On uniprocessors, threads are used as a program structuring aid
or to overlap I/O with processing. The metric of goodness for
these thread management implementations is simply processing cost per
thread creation or context switch. No locking is needed inside
thread routines, since only one routine can be executing at any one time.
.PP
Programs on multiprocessors use threads to exploit parallelism.
The speedup achievable by any given application
depends on the availability of thread management routines
that provide low cost facilities that are not a serial bottleneck.
In Sequent's DYNIX operating system, for example, applications
must use normal UNIX-like processes for parallelism [Sequent 1988].
Since process creation in DYNIX takes over 25 milliseconds, only very
coarse-grained parallelism can be exploited. As another example, the
Topaz kernel provides relatively inexpensive thread creation and
synchronization,
but the routines are protected by a single lock [Thacker et al. 1988].
While this
may be appropriate for architectures with small numbers of processors,
as the number of processors increases, the single lock could limit
speedups for applications with fine-grained parallelism.
.PP
Our initial experience in the area of high-performance thread packages
was with Presto, an application-level runtime library that
relies on the kernel only for processor allocation and memory
management [Bershad et al. 1988a]. This work showed that there is an
order of magnitude performance advantage to using threads instead of
DYNIX processes for exploiting parallelism. Drawing on this experience, we
implemented a thread package that is, in turn, another order of magnitude
faster than Presto. This basic package was then modified
to implement each alternative we wanted to explore.
.PP
One consequence of the speed of our basic thread package is that small
changes in the organization of data structures and locks have a significant
impact on performance. Often, the choice involves a tradeoff between
latency and throughput. Per-processor data structures can sometimes
be used to avoid locking, however, improving latency and throughput
at the same time.
.PP
Another consequence of the speed of our thread package is that
its performance depends noticeably on the algorithm used to queue for locks.
Earlier, we studied the relative performance of spinning and blocking
locks [Zahorjan et al. 1988]. In general, a thread that tries to
acquire a lock that is already held can either spin ("busy-wait")
until the lock is released, or relinquish the processor.
However, within the thread management routines themselves, spinning
is the only option. Thus, blocking at the user level may require
spin-waiting in thread management routines.
Spin-waiting has a cost not only to the processor waiting
for the lock, but also to processors doing useful work. The
degradation of other processors becomes substantial for moderate
numbers of waiting processors, especially for small critical sections.
We present an Ethernet-style backoff algorithm that largely
eliminates this effect.
.PP
The following sections describe these issues in more detail.
In Section 2 we present an abstraction of a thread package: its
objects, resources, and operations. Section 3 outlines the
strategies for thread management that we examined and presents
measurements of their relative performance. Section 4
compares methods of queueing for locks. Section 5 combines these
results in an analytical model. Section 6 summarizes
our experiences.
.NH
An Abstract Thread Package
.PP
As noted in Section 1, threads gain efficiency by separating the
notion of execution from the rest of the definition of a process.
The data structures needed by each thread are a program
counter, a stack, and a control block. (The control block contains state
information needed for thread management. Through the control block,
the thread can be put onto lists and other threads can synchronize with
it.) Another important data structure is the ready queue, which lists
threads that are ready to run. Lampson and Redell
[1980] provide a good description of
the functionality of a uniprocessor thread package.
.PP
Thread operations are shown in Table 2.1.
Creating a thread can be viewed as calling a procedure, except that
the callee can execute in parallel with the caller. In both cases,
the caller specifies a place to begin executing and some number of arguments.
In fact, thread creation and startup is semantically equivalent to
an asynchronous procedure call.
.KF
.TS
center;
l.
Thread Creation
Allocate and initialize a control block, saving the initial PC.
Allocate a stack and copy in the thread's arguments.
Place the new thread on the ready queue.
.sp .5
Thread Startup
Remove the thread from the ready queue and begin to execute it.
.sp .5
Thread Block (wait on a blocking lock, monitor condition variable, or message)
Save register values and PC on the thread's stack.
Place the thread on the condition queue for the event.
Look for a thread in the ready queue, and start or resume it.
.sp .5
Signal a Blocked Thread
Remove the thread from the condition queue.
Place the thread on the ready queue.
.sp .5
Thread Resume
Remove the thread from the ready queue.
Restore registers.
Continue executing it from the saved PC.
.sp .5
Thread Finish
Deallocate the stack and control block.
Look for a thread in the ready queue, and start or resume it.
.TE
.ce
\fBTable 2.1: Thread operations\fR
.KE
.PP
As Table 2.1 shows, a program can create a thread even if there is no
idle processor available to run it. Because the parallelism cannot
be immediately exploited in this case, it might seem that the overhead
of thread creation should be avoided. The program may run faster by
creating the thread, however, if at some future time there will be an
idle processor that can be used to execute the thread.
This idea of creating parallelism for future use is very powerful.
Unfortunately, in the above framework, its space cost is prohibitive.
Each thread must be initially allocated a large amount of space for
its stack, since it is expensive to dynamically expand the space if
the thread later runs out of it. In Table 2.1, the thread is allocated
space for a stack when it is created, but the space is largely wasted
until the thread is actually started. Using virtual memory could remove
the need to allocate physical memory to back the stack space until the
thread begins to run; however, allocating extra virtual memory is itself
expensive.
.PP
An important optimization to Table 2.1, therefore, is to copy a thread's
arguments into its control block when the thread is created. This way,
the stack need not be allocated until thread startup; the arguments
can be copied from the control block to the stack at that time.
WorkCrews [Vandevoorde & Roberts 1988] and Presto
[Bershad et al. 1988a] both take this approach.
.PP
Another important optimization is to store deallocated control blocks and
stacks in free lists [Bershad et al. 1988a]. If these data structures were
individually allocated out of the heap, thread overhead would include
the cost of finding a free block of the correct size as well as
possibly coalescing the block when it is returned to the heap.
By using free lists, both allocation and deallocation can normally
be simple list operations.
.PP
We begin our study by assuming these optimizations. For simplicity, we will
focus on the effect of thread management alternatives on the performance
of only a few thread operations: creation, startup, and finish.
These operations manipulate each of the three shared data
structures: the ready queue,
the stack free list, and the control block free list.
Most of the discussion applies as well to the peformance of block
and resume operations.
.NH
Thread Management Alternatives
.PP
In a parallel environment, access to shared data structures must be
serialized to ensure consistency and correctness. Our thread package uses
spinlocks for this purpose: when a processor tries to modify a data
structure, it must first lock it to obtain exclusive access; if
some other processor already holds the lock, the processor loops until
the lock is released.
.PP
Locking implies dual concerns of latency and throughput
[Kumar & Gonsalves 1977]. Latency
is the cost of thread management under the best case assumption
of no contention for locks. Throughput, on the other hand, is the
rate at which threads can be created, started, and finished when there
is contention. If part of
thread management must be done serially, then no matter how many processors
work on a problem, there will be some maximum rate at which thread operations
can be performed.
.PP
There are several ways of defining latency, with different
implications for different types of applications. If an application
keeps all of its processors continually busy, for instance by creating
threads before they are needed, then any time spent in creating,
starting, or finishing a thread is time that could have been spent doing
other useful work. When a thread finishes, however, if there is no other work
for the processor to do, the time spent deallocating the thread's data
structures is unimportant. Instead, the relevant issues include
how much a creating processor is delayed, since it has a thread to
run, and how much time it takes for the created thread to begin
running on a processor.
.PP
In the following subsections, we define five alternative thread
management strategies, and describe some of the potential advantages
and disadvantages
of each approach. We then provide measurement and analytical
comparisons of these alternatives.
.NH 2
Single lock: central data structures protected by a single lock
.PP
The most obvious approach to thread management is to protect all
data structures under a single lock. Once the lock is acquired
by a processor, the processor is assured that it can modify any stored state.
To perform a thread operation, a processor must first acquire the lock,
then do what is needed to the shared data, and finally release the lock when
done. In this way, only a single lock is needed per thread operation,
but, since most of
the thread management path is serialized, throughput is limited.
In the typical scheme, idle processors loop checking the ready queue for
work to do,
causing useless contention for the ready queue lock; however, this can
be avoided if idle processors check that the ready queue is not empty before
acquiring the lock.
(Ni and Wu [1985] present a different approach.)
.NH 2
Multiple locks: central data structures each protected by a separate lock
.PP
A somewhat more modular approach to locking is to separately
protect each data structure with its own lock [Lampson & Redell 1980].
Each operation on the data structure can then be surrounded by
a lock acquisition and release. For thread management, this involves
separately locking each enqueue and dequeue operation on the ready
queue, stack free list, and control block free list, the three shared data
structures.
.PP
There is a basic tradeoff between latency and throughput in the
choice between using a single lock or multiple locks in protecting
shared data structures [Kumar & Gonsalves 1977]. Since less of the total
thread activity is in a critical section, and since it is split among
several locks, the maximum rate of thread operations is higher with multiple
locks than
with a single lock. There is a cost to this increased throughput,
however: more lock accesses are needed, increasing latency.
.NH 2
Local freelist: per-processor free lists without locks; a central
locked ready queue
.PP
One way of avoiding locking is to maintain as much state as possible
locally, with each processor. If each processor maintains its own
free lists of control blocks and stacks, these structures need not
be locked, since only one processor will ever access them. As
before, there is a single shared ready queue whose accesses are locked.
.PP
The tradeoff between latency and throughput can be largely avoided
by using local free lists. Since fewer lock acquisitions are needed
per thread, latency is lower than with multiple locks, yet since only
accesses to the ready queue are serialized, throughput is better.
.PP
Local free lists need to be balanced. Control blocks
and stacks can migrate between free lists if the thread is created or
started on one processor and finished on another. This can happen, for
instance, if
a thread blocks and is resumed on a different processor. Thus, one free
list can be empty, requiring the processor to obtain more space from
the heap, while another free list has many entries. In the worst case,
some processors only create and start threads (allocate structures),
while other processors only finish them (deallocate structures).
Without balancing, the deallocated structures are never re-used;
a separate stack and control block are needed for every thread.
In contrast, with a centralized free list, only as many are needed
as there are active (created or started, but not finished) threads.
.PP
It is inexpensive, however, to balance free lists by using a
global pool and a threshold $T$ on the maximum size of each list. When
the size of a free list reaches the threshold, half the list can be
returned to the global pool; when a free list empties, $T / 2$
entries can be claimed from the pool. The global pool must be locked,
of course. For efficiency, it can be organized as a list of lists. The
processing cost to balancing is thus one locked pool access amortized
across at least $T / 2$ free list accesses.
Let $P$ be the number of processors. An application using balanced
local free lists will use no more than $O(P times T)$ more space
than one using a central list, even if one processor only allocates
structures that other processors deallocate. The worst case occurs when the
allocating processor's free list and the global pool are empty even
though the other local free lists are almost full.
.PP
Thus, local free lists trade space for time. This tradeoff
is practical for control blocks. Utilization
of the pool lock is at most $O( P times {R over T} )$, where $R$ is the rate of
thread creation on a single processor. To ensure that the pool lock
is not a source of contention (which would inflate the overhead per
free list access), we can set the threshold $T$ to be equal to $P$.
Control blocks are relatively small objects (in our implementation,
roughly $100$ bytes); provided $P$ is not excessively large, using
$100P$ bytes per processor is not onerous. If $P$ is large,
then a tree of pools could be used to limit the cost to balancing to
$O(P / log~P)$ bytes per processor.
.PP
The tradeoff is not practical for stacks, however.
Stacks are at least two orders of magnitude larger than
control blocks. Even if sufficient memory were available, using
that memory entails processing costs for initializing page tables
and increased cache miss rates that could easily overwhelm the
advantage gained from decreased locking. Instead, we let the local
stack free lists contain at most one element.
In this way, stacks need be allocated
from the global pool only when a processor blocks a thread and then
starts up a different thread, and deallocated only when a processor
finishes a thread and then resumes another thread.
.NH 2
Idle queue: a central queue of idle processors; per-processor free lists
.PP
None of the algorithms described so far exploit parallelism
in creating threads. The creating processor allocates and initializes
the control block; only when it is done is the starting processor allowed
to allocate and initialize the stack.
The cost of thread creation could be reduced if some of the work
was done by idle processors in parallel with the creating processor.
.PP
In addition to a central queue of threads, we can maintain a
central queue of idle processors. When there is a backlog of ready
threads, there is no point to attempting parallel thread creation since
all processors are already doing useful work. When a processor becomes
idle and there is no backlog, it pre-allocates a control block and stack,
puts itself on the idle queue, and spins on a local flag waiting for work.
Thread creation then dequeues the idle processor, initializes the
pre-allocated control block and stack, and sets that processor's flag,
indicating that it now has a thread that is ready to run. Instead of
processors searching for work, work searches for processors.
.PP
In fact, this approach does not alter the essentially sequential
nature of thread creation. The idle processor must first queue itself
before the creating processor can dequeue it, which in turn must
set the flag before the idle processor can start running the thread.
The critical path between the beginning of thread creation and when
the thread starts running is reduced by doing some of the work
(allocating structures, acquiring a lock, enqueueing)
before the critical path begins.
Since this adds complexity, and there is no benefit in the absence of idle
processors,
the effect is to trade off reduced latency when there are idle processors for
increased latency when all processors are busy. Maximum throughput should be
unchanged since two locked queue operations are still needed per thread
life cycle. Wagner et al. [1989] describe a different way of using of
idle processors to avoid work during blocking and resuming.
.NH 2
Local readyq: per-processor ready queues; per-processor free lists
.PP
Once free lists are made local, the ready or idle queue lock can become a
serial bottleneck as the number of processors or the rate each processor
schedules work increases [Dritz & Boyle 1987].
One way of increasing throughput is to
divide the load on a single lock among several locks. An application
of this idea is to keep a ready queue per processor. In this way,
enqueueing and dequeueing threads can occur in parallel, with
each processor using a different queue. There is again a tradeoff
between latency and throughput in the choice between using one or more
ready queues.
.PP
Unlike the case of control block free lists, unlocked local ready queues
are inefficient even if balanced through a global pool.
Runnable threads are a scarce resource.
An idle processor might have an empty queue, yet a ready thread that
the processor could run is in some other processor's queue, while the global
pool is empty. Performance can be arbitrarily bad
in any scheme where a processor can be idle indefinitely while there is even
one ready thread in some other queue. For instance, suppose $P$ identical
threads are created, but due to an imbalance, only $P~-~1$ are started
while one processor idles. The runtime would then be twice as long
as with any of the centralized queueing strategies.
.PP
One simple way of avoiding indefinite idling is to lock each ready queue;
each idle processor can then scan the ready queues for work,
beginning with its own [Dritz & Boyle 1987].
If there is a ready thread, an idle processor will
eventually find it. Processors can queue created threads locally, since
balancing is achieved by idle processors. However, contention can still
occur if a single processor enqueues (during create or signal) every
ready thread; that processor's queue
would operate as if it was a central ready queue, except that
idle processors would have to waste time scanning for it. A simple
way of avoiding this situation is to randomly choose a queue for
each newly ready thread.
.PP
If each queue is equally likely to get a new ready thread, latency is
bad when the number of runnable threads is near to the number of processors.
There are two cases. Consider the cost of scheduling a thread onto
a newly idle processor. If there are no ready threads, there is effectively
no cost until a new thread is created. If there are ready but not running
threads, any time spent finding a thread to run could have been spent
running that thread. This time is small when there are many ready
threads, because the idle processor will find the thread after scanning
only a few queues; however, when there
is only a single ready but not yet running thread, the processor will
have to examine on average half of the queues in order to find it.
The cost of scheduling a newly created thread onto an idle processor
is similar: the thread will be found quickly if there are many
idle processors and more slowly if there are only a few.
.PP
One reason to have a one-to-one correspondence between processors
and ready queues is to maintain locality. There is an application-specific
cost to migrating a newly resumed thread from the processor it last ran
on, due to increased cache misses. The ability to avoid migration
depends on the backlog of ready threads [Eager et al. 1986].
.PP
If maintaining locality is not important, then there is a tradeoff
between latency and
throughput in choosing the number of queues [Ni & Wu 1985].
Up to some maximum, throughput is higher with more queues, but
the number of queues that must
be scanned to find work, and thus the latency, is also higher.
We set the number of queues equal to the number of processors for
all measurements.
.NH 2
Measurement results
.PP
To validate our intuitions about the relative merits of the alternative
approaches, we implemented each on a Sequent Symmetry Model A shared-memory
multiprocessor. All code was written in C and compiled with Sequent's
standard compiler, with the exception of the locking and context
switching code, which was programmed in assembler.
Our Symmetry has twenty Intel 80386 processors, a shared bus,
and a write-through cache coherency protocol [Lovett & Thakkar 1988].
The Symmetry has a timer with microsecond resolution that was used for
all measurements. Table 3.1 contains times for sample Symmetry operations.
.KF
.TS
center, box;
c c
l n.
Operation Runtime (\(*msec.)
_
Acquire and release a lock 5.6
Procedure call with no arguments 3.6
Each 4-byte argument 1
Iteration of null loop 2.5
.TE
.ce
\fBTable 3.1: Runtimes for Symmetry operations (measured)\fR
.KE
.PP
For all measurements, free lists were "warm started": sufficient control
blocks and stacks were pre-allocated for use by the benchmark. Our
purpose was to measure the relative merits of each alternative, rather
than the efficiency of the underlying memory management system. The
cache was not warm-started, but we ran each benchmark long enough
for this effect to become insignificant.
.PP
Figure 3.1 is the principal performance comparison: it shows the elapsed
time in seconds for each thread management alternative to create,
start, and finish one million "null" threads, for varying numbers of
processors. The one processor case shows the latency for a single thread in
microseconds when there is no contention for locks.
.PP
Table 3.2 lists the code for this test.
Initially, $P$ threads are created; each recursively creates
a thread then finishes, allowing that processor to start up one of the
waiting threads. The test terminates when each processor has executed
1 million / $P$ threads; in practice, we found that the processors all
completed at roughly the same time.
For the multiple ready queue alternative, each
newly created thread was added to a random queue to avoid biasing the results
with the effect of locality. This test is not intended to be representative
of a real parallel program, but it does expose the tradeoffs between the five
alternatives. ("numberOfThreads" is a location private to each processor,
initially set to 0.)
.KF
.ft C
.ps -1
.vs -1
.TS
center;
l.
ThreadCycle () {
numberOfThreads = numberOfThreads + 1;
if (numberOfThreads == 1000000 / numberOfProcessors)
Synchronize(); /* wait for all processors to reach here */
else
ThreadCreate(ThreadCycle);
}
BeginTimer();
for (i = 1; i < numberOfProcessors; i++)
ThreadCreate(ThreadCycle);
ThreadCycle();
StopTimer();
.TE
.ps +1
.vs +1
.ft P
.ce
\fBTable 3.2: Benchmark: create, start, and finish 1 million null threads\fR
.KE
.PP
Figure 3.2 shows the inverse graph for the same test: speedup as a
function of the number of processors. We define speedup to be the
ratio of the time for the the fastest alternative on one processor
(single lock) to the time each alternative takes on $P$ processors.
Speedup is proportional to throughput; this graph shows the maximum
number of thread creates, starts, and finishes that can be performed
in parallel for each alternative.
.KF
.sp 3.0i
.ce 2
\fBFigure 3.1: Principal results for thread management \- elapsed time
to create, start and finish 1,000,000 null threads (measured)\fR
.sp .5
.sp 3.0i
.ce
\fBFigure 3.2: Speedup to create, start and finish 1,000,000 null threads (measured)\fR
.sp .5
.KE
.PP
Before examining the relative performance of the five alternatives,
we note that each of them has quite good performance.
Threads are
only an order of magnitude more expensive than a procedure call, and
500 times less expensive than normal DYNIX processes.
Threads in Presto cost 600 microseconds on the
same Symmetry hardware, an order of magnitude worse than our threads
although an order of magnitude better than DYNIX processes.
.PP
While Presto's speedup relative to DYNIX is due to using threads instead
of processes, our speedup relative to Presto is due to attention to
implementation details. We implemented Presto in C++; while this enhanced
its ability to be modified [Bershad et al. 1988b], its C++ was first
pre-processed into C, then compiled. This resulted in much less efficient
code than could be achieved by direct coding in C. Another factor is that
we stripped thread control blocks of all non-essential state, reducing
the cost of initialization dramatically. We did not remove
functionality: our thread package
could be given Presto's user interface without
sacrificing its performance.
.PP
Because our threads are inexpensive, the choice of alternatives has a
large relative impact on both latency and throughput for applications with
fine-grained parallelism.
Specifically:
.IP \(bu
Adding even a single lock acquisition into the thread management
path can increase latency significantly. Locking each of the data
structures separately results in a much higher latency than
locking all data structures under the same lock. Using
per-processor data structures to avoid locking
is thus crucial to decreasing latency without sacrificing throughput.
.IP \(bu
Additional complexity results in a noticeable increase in latency.
There are on the order of 100 instructions in the thread management
path; adding even a few extra instructions impacts performance.
For example, the idle queue strategy checks for idle processors on
thread creation. If the idle queue is always empty, as in the measurements
in Figure 3.1, it defaults to a normal ready queue.
Even this simple check markedly increases the cost of threads.
This implies that thread management routines must be kept simple;
enhancements that would otherwise seem plausible but add complexity
are unlikely to work, since there is little computation
to save, and it is easy to swamp the savings with increased overhead.
.IP \(bu
A large portion of the thread management path is locked, since
little work is required beyond manipulation of shared data. When all data
is kept under a single lock, throughput is limited by
contention for this lock. However, even with local free lists,
the lock on the ready queue limits throughput to a few concurrent
thread operations. Only local ready queues can support
higher throughput.
.IP \(bu
When lock contention is not a problem, the bandwidth of the bus
limits thread management throughput. The curve in Figure 3.2 levels
out for the local ready queue alternative, even though there is
no significant contention for locks. While the high bus demand per
thread may be specific to the write-through cache protocol on the Symmetry,
bus contention is likely to be a problem on any bus-structured shared-memory
system.
.PP
In Figures 3.1 and 3.2, threads do no work except to create other threads.
It is natural to ask whether the performance implications of the thread
management alternatives would still be significant in the presence
of user-mode computing. We modified the test in Table 3.2 so that each
thread performs an average of 300 microseconds of user work, taken from a
uniform distribution. This simulates the behavior of an application
with fine-grained parallelism.
.PP
Figure 3.3 graphs speedup for this modified benchmark. Differences appear
as the number of processors increases.
.KF
.sp 3.0i
.ce
\fBFigure 3.3: Speedup, user work = 300 \(*msec. (measured)\fR
.sp .5
.KE
.PP
Figure 3.4 graphs thread cost in microseconds as a function of the number of
runnable threads (parallelism).
Thread cost was directly measured by taking timestamps
before and after each thread was created and whenever a thread was started
or finished. Multiple creations and completions were measured and averaged to
improve accuracy; they were synchronized to avoid measuring lock contention.
.KF
.sp 3.0i
.ce
\fBFigure 3.4: Thread latency (\(*msec.) vs. number of runnable threads, 18 processors (measured)\fR
.sp .5
.KE
.PP
There are a few things to note in Figure 3.4:
.IP \(bu
The curve for the central ready queue alternative jumps
when the number of runnable threads reaches the number of processors
because of a change in the definition of thread latency.
When there are fewer threads than processors, thread cost is taken to be
the time to create and start running a new thread. The time to finish
a thread is unimportant if the idling processor has no work to do.
When there are as many or more runnable threads as processors, the cost is
the sum of the time
to create a thread plus the time to finish it and start a new thread.
The thread latency reported in Figure 3.1 with
one processor corresponds closely to the latency reported in Figure 3.4
when there are more runnable threads than processors.
.IP \(bu
Using an idle queue is faster when there are idle processors,
but slower when there are more runnable threads
than processors. Thread creation is faster if an idle processor can be
used to do work before the thread is created, but checking the idle
queue incurs overhead even if it is not used. Whether a particular
application will run faster with an idle queue depends on how much
time it spends in each case.
.IP \(bu
The spike in the curve for per-processor ready queues shows that
finding a ready thread among many queues is
expensive when the parallelism of the application is near to the
number of processors, but the expense fades when more
ready threads or more idle processors are available. Since the
height of the spike grows linearly with the number of queues,
latency increases in proportion to the maximum throughput.
However, the amount of user computing per thread needed to avoid
contention for a central ready queue also grows linearly with the number
of processors. Thus, the cost of searching for work among per-processor
ready queues as a fraction of the time to do that work is not large unless
the multiple ready queues are needed for additional throughput.
.PP
One area of further research is to examine hybrid thread management
strategies to combine the advantages of some of the alternatives we
have presented. For example, a per-processor version of the central
idle queue could exploit locality and parallelize thread creation
without compromising throughput. As another example,
both central and per-processor ready queues could be used, by
placing threads in a local queue if the lock on the central queue is
busy. The drawback to any such approach is that complexity adds cost
which may outweigh any benefits.
.NH 2
Analytical explanation of Figure 3.4
.PP
We now derive a formula that explains in detail the spike
for the per-processor ready queue alternative in Figure 3.4.
When there are idle processors, we need to know the time between the
queueing of a ready thread and the dequeueing of that thread by an
idle processor; when there is a backlog of ready threads, we need to
know how long it takes a newly idle processor to find one of the threads.
.PP
Let $E(r,q)$ be the expected number of
queues examined by a newly idle processor to find one of $r$ ready threads,
which are randomly distributed among $q$ queues. Without loss of
generality, let the queues be numbered from $1$ to $q$, let threads
be numbered from $1$ to $r$, let $i sub j$ be the queue containing the
$j$th thread, and let the idle processor
begin searching with queue 1. The idle processor must examine the number
of queues equal to the lowest numbered non-empty queue.
The number of ways of putting $r$
threads into $q$ queues is $q sup r$.
.br
.EQ I
E(r,q)~=~
{1 over q sup r}~
sum from {i sub {1}~=~1} to {q}
~{sum from {i sub {2}~=~1} to {q}
~{... sum from {i sub {r}~=~1} to {q}
{~minimum~of~(i sub {1},~i sub {2}, ... i sub {r})}}}
.EN
.br
We can separately sum when each $i sub j$ is the minimum. When
more than one thread is at the minimum, we count the value once
in the sum for the least numbered thread. Thus, the value of $i sub j$
is counted whenever for all $k < j$, $i sub k > i sub j$, and for
all $k > j$, $i sub k >= i sub j$. There are ${(q - {i sub j})} sup {j-1}$
values for the $i sub k$, $k < j$ that satisfy the first condition
and ${(q - {i sub j} + 1)} sup {r-j}$
values for the $i sub k$, $k > j$ that satisfy the second.
.br
.EQ I (3.1)
E(r,q)~=~
{1 over q sup r}~
left (~q~+~sum from {j = 1} to {r} ~{sum from {i = 1} to {q-1}
{i {(q - i)} sup {j-1} {(q - i + 1)} sup {r-j}}} right )
.EN
.PP
By symmetry, Equation 3.1 also holds when there are more processors
than runnable threads. Let $r$ be the number of idle processors,
let $i sub j$ be the queue currently scanned by the $j$th idle
processor, and let the newly created
thread be put into queue $1$. Then the processor that actually dequeues
the thread will have to look through $E(r,q)$ queues, after the
thread is queued, in order to find it.
.PP
Figure 3.5 graphs Equation 3.1 for 18 processors. In order to correspond
to Figure 3.4, the x-axis is the number of runnable threads, rather than
the number of ready but not running threads or the number of idle processors.
Since part of the spike in Figure 3.4 is due to the difference
in the measurements when there are idle processors or not, the
curves in Figures 3.4 and 3.5 correspond well.
.KF
.sp 3.0i
.ce
\fBFigure 3.5: Queues examined vs. number of runnable threads, 18 processors (Equation 3.1)\fR
.sp .5
.KE
.PP
The above analysis assumes that events occur one at a time. Since
finding a ready thread among a number of queues can take a non-trivial
amount of time, it is reasonable to consider what happens when another
thread is created or another processor becomes idle during the interim.
Suppose another thread is created before a newly idle processor finds one
of the $r$ ready threads. Let $C$ be the cost (in number of queues
examined) of finding a thread in this situation. $C$ can be no better than
if the new thread had been there all along and no worse than if the new
thread is ignored. In other words, $E(r+1,q)~<=~C~<=~E(r,q)$.
Similarly, if another processor becomes idle in the interim, provided
$r~>=~2$, the combined cost for both processors to find threads is
$E(r,q)~+~E(r-1,q)$, assuming the processors do not contend for the same
queue, independent of which processor finds a ready thread first.
.NH
Spinlock Management Alternatives
.PP
If a processor finds a thread management lock busy, it spin-waits
for the lock to be released. An alternative would be to block, relinquishing
the processor to do other useful work while the lock is busy.
However, since thread management locks are held for only a short time,
the overhead of performing this context switch would be prohibitive;
in addition, any other work that the processor might do is controlled
by a (or possibly the same) thread management lock.
When an application lock is busy, a thread does have a choice between
spin-waiting or blocking, but blocking at the user level may result
in spin-waiting in a thread routine.
.PP
Spin-waiting has a hidden cost. Processors doing useful work may be
slowed by processors that are merely waiting for a lock, due to bus
contention. As a result,
adding to the number of processors executing an application may
in fact slow it down by increasing the average number of spinning
processors. Worse, the more spinning processors, the more the
processor holding the lock is slowed, increasing the effective
size of the critical section, resulting in even more waiting processors.
.PP
In this section, we evaluate three different approaches to spin-waiting.
.NH 2
Hardware description
.PP
On the Symmetry Model A, each processor has its own cache; provided
all of its memory references can be satisfied out of that cache,
a processor's progress is independent of the activity of other processors.
Whenever a processor reads data that is not in its cache, it must wait for
the data to come from memory via the bus; with a write-through protocol,
a processor may also have to wait for writes to be sent to memory.
In both cases, the wait can be longer and the processor's progress slowed
because of bus contention.
.PP
The Symmetry has a basic test-and-set instruction, xchgb (exchange byte),
that atomically reads a memory location and writes in a new value.
The atomicity of the xchgb operation is enforced by the bus: a copy of
the memory location is brought into the processor's cache, modified there,
and then written back to memory. As the value is written to memory,
all copies of the old memory value in other caches are invalidated.
No comparison of the old and new values is performed; memory is written
and other copies invalidated even if the value is unchanged.
Any requests for
that memory location in the interim are delayed until the processor
is done modifying it [Lovett & Thakkar 1988].
.PP
The Sequent locking protocol is as follows: The lock is held if the value
is a 1 and free if 0. To lock, a processor exchanges
in a 1. If the old value was a 0, it got the
lock; if the value was a 1, the lock was already held by someone else,
and the processor must try again. In either case, the value
is 1 afterwards. The lock is released by exchanging in a 0; this
allows some other processor to get a 0 back in exchange for a 1.
There are several potential protocols for spin-waiting, which are
described below.
.NH 2
Spin on xchgb
.PP
Perhaps the simplest way to implement spin-waiting is for each processor
to loop on the xchgb instruction until it succeeds. The drawback
to this approach is that every xchgb instruction consumes bus
resources, whether or not it succeeds.
As additional processors spin on the lock, the holder
of the lock is slowed both because the bus is busier and because to
free the lock it must contend for permission to update the lock
value with the xchgb's of processors uselessly trying to acquire the lock.
.NH 2
Spin on read
.PP
Coherent caches seem to allow processors to spin without using bus cycles.
A processor can try to acquire the lock once; if this fails, the
processor can spin reading the lock memory location. As long as the
value is 1, the lock is still held. This spinning is done in the
the cache, avoiding bus traffic. When the lock
is released, the cache copy will be invalidated; the spinning processor
will see the value change to 0 and can then try to acquire the lock
using an xchgb operation.
Sequent's runtime library uses this implementation [Sequent 1988].
.PP
A problem arises when there are a number of processors waiting for a
small critical section. When the lock is freed, every spinning
processor's cache copy is invalidated, causing each processor to fetch
the new value in turn.
The first to try to acquire the lock succeeds; however, each processor
that sees the value as 0 before this occurs will also, in turn, try to
acquire the lock, fail, and go back to looping on a read. Unfortunately, each
processor that does an unsuccessful xchgb operation invalidates all
cache copies, forcing all processors that were looping to miss again.
Thus, after each such operation, almost every spinning processor contends
for the bus, some still waiting to do an xchgb and the rest to fetch
the lock value. Eventually, each processor sees that the lock has been
acquired and quiesces, looping in its cache.
.PP
For a given number of spinning processors,
the performance of this algorithm is better for longer critical sections.
After the lock is released and before quiescence,
each spinning processor spends most of its time with a pending bus request;
any normal bus request during this time will be correspondingly delayed.
After quiescence, the spinning processors
place no load on the bus, allowing the processor
holding the lock to progress unhindered. With longer critical
sections, the initial degradation is less significant.
By contrast, spinning on the xchgb instruction degrades bus performance
evenly throughout the critical section.
.NH 2
Ethernet-style backoff
.PP
The source of the difficulty is that there is a cost to attempting
to acquire the lock. A generic solution to problems
of this sort is to have each processor estimate its likelihood
of success, and only try the lock when the probability is high. The
estimate can be made from experience. The more times a processor
has tried and failed, the more likely it is that many processors
are spinning for the lock. When the lock is released, then, instead
of every processor rushing to try to get it, each waits a period
of time dependent on the number of past failures. If the lock is
still free after this period, then the probability of success is
high enough to try the lock. We used this algorithm for our
measurements in Section 3.
.PP
The analogy with Ethernet is revealing. In the Ethernet protocol,
a processor can start a network transmission in any time slot that
the network is free [Metcalfe & Boggs 1976]. If two try to start
transmitting in the same slot, both fail and must be retried later.
To avoid further collisions,
the length of time before retrying depends on the number of
collisions encountered so far. In our case, when a number of
processors simultaneously try to acquire a lock, one will succeed,
but its progress will be slower than if there were no collisions.
.PP
The downside to Ethernet-style protocols is that they are unfair.
A processor that has just arrived is more likely to acquire the
lock (or network) than one who has been waiting, and failing, for
some time. Spinning on a test-and-set instruction and spinning
on a read of the lock location are both probabilistically fair; each
spinning processor has an equal likelihood of getting the lock,
even though the possibility of indefinite starvation exists.
Lock fairness is sometimes important to an application.
.PP
Another drawback of the backoff algorithm is that it takes longer for a
spinning processor to acquire a newly free lock. The processor must check
the lock value, delay, and check it again before trying the lock.
Once the lock is acquired, however, the processor will proceed faster,
relatively unimpaired by other spinning processors.
.PP
Even using this algorithm, there will be processor degradation when
there are large numbers of spinning processors. When the lock
is released, every spinning processor encounters a cache miss. After
this initial miss, most processors delay locally until some other processor
has acquired the lock, and then miss again to see that the lock has been
acquired. With enough spinning processors, the bus can be saturated
with these misses, slowing down the processor executing in the critical
section.
.PP
These cache misses can be avoided. A processor can delay whenever it
reads the lock value as busy. If the lock is not busy, the processor
can immediately try to acquire it. Thus, spinning processors miss
their cache every time the delay period expires,
rather than every time the lock is released. This is analogous to the
Ethernet notion of persistence [Metcalfe & Boggs 1976].
A result of this variation is
an even greater delay between when a lock is released and when a spinning
processor will acquire the lock.
.PP
Processors can spin-wait (degrading other processors) for things
other than locks. Agarwal and Cherian [1989] apply backoff
to spin-waiting for data to become available.
Spin-waiting can also be a problem with idle processors polling a central
or distributed ready queue.
When a ready thread is queued, if every idle processor rushes
to acquire the lock, bus saturation will result. Even if each idle processor
delays after observing that a thread is queued, then makes sure that it
is still queued, there is still a cache miss per idle processor,
hurting performance for large numbers of idle processors.
.PP
If idle processors are kept on a queue, this problem does not occur.
Each idle processor spins on a separate flag. When a thread becomes ready to
run, only one processor's flag is modified; every other processor continues
spinning without even a cache miss. The performance advantage of having
work look for processors instead of processors looking for work
will therefore be more important in systems with large numbers of processors.
This effect can be seen in Figure 3.4; the cost of the central ready queue
is higher when there are only a few runnable threads, since there are more
idle processors spin-waiting for work to appear in the ready queue.
.NH 2
Measurement results
.PP
Figure 4.1 shows the elapsed time for various numbers of processors
to cooperatively increment and test a shared counter
in a critical section 1 million times, for each method of spin-waiting.
Each processor executes a loop: wait for the lock, increment the counter,
and release the lock.
We do not claim that this test is representative
of the normal use of critical sections, but similar curves have been measured
with more significant computation between lock accesses.
Since there is little parallelism, if spinning
processors did not slow the processor holding the lock, the curve
would be flat.
.PP
The magnitude of this effect is striking. Both spinning on the
xchgb instruction and spinning on a read
degrade processor performance badly for even a moderate number of
spinning processors. For small critical sections, in either alternative,
every spinning processor spends all of its time doing cache read misses
or atomic xchgb operations, consuming bus cycles as fast as possible.
By contrast, the backoff algorithm results in only slight degradation
unless the number of spin-waiting processors exceeds ten.
.KF
.sp 3.0i
.ce 2
\fBFigure 4.1: Principal results for spin-waiting: elapsed time
to increment a shared counter to 1,000,000 (measured)\fR
.sp 3.0i
.ce
\fBFigure 4.2: Relative processor speed (8 processors to 1 processor) vs. critical section size (measured)\fR
.sp .5
.KE
.PP
Figure 4.2 shows the effect of increasing the size of the
critical section on each algorithm's performance. In addition to
incrementing a counter, the critical section contains varying
amounts of other work. We normalized the time for eight processors
by the time for one processor.
This measures relative processor speed.
Again, if spin-waiting did not slow the
processor holding the lock, one processor would not be faster than eight,
and the relative processor speed would always be equal to 1.
As expected, spinning on a read degrades performance less as
the size of the critical section grows, while spinning on the xchgb
instruction degrades performance evenly throughout the critical section.
.PP
To test the tradeoff between processor degradation and the delay
in acquiring a newly released lock, we measured the elapsed time
for a number of processors to each increment a shared counter within
a critical section. Once a processor acquired the lock and bumped
the counter once, it was set to loop until all processors were done.
This test is indicative of the cost of using a lock for barrier synchronization.
Figure 4.3 shows the elapsed time divided by the number of processors.
If there is no processor degradation or delay in acquiring the lock,
the elapsed time to achieve the barrier should increase linearly
with each additional processor; the normalized curve in Figure 4.3 should
be flat.
.KF
.sp 3.0i
.ce
\fBFigure 4.3: Normalized time (\(*msec. per processor) to achieve barrier (measured)\fR
.sp .5
.KE
.PP
Figure 4.3 shows that for small numbers of processors,
spinning on the xchgb instruction is fastest, since a processor
immediately acquires the lock when it is released. As more processors
are added, however, even though the lock is acquired faster, this
is outweighed by the degradation of the processor holding the lock.
The backoff algorithm shows a similar curve to spinning on a read,
but for a different reason. Initially, many processors are queued for
the lock; this leads spinning processors to guess large delay times.
As more processors acquire the lock, there are fewer queued processors,
and the delays become inappropriate.
.PP
Processors doing work are slowed proportional to the number of times
they access the bus. Thus, the results of these tests depend somewhat
on the content of the critical section. However, since the purpose of
a critical section is to serialize modifications to shared data, its
code is likely to be bus intensive.
Our measurements indicate that almost half of the bus service demand
of thread management is due to the critical section.
Further, thread management critical sections also tend to be small.
For example, enqueueing or dequeueing a ready thread in a critical
section both take less than 10 microseconds, roughly the same as for Figure 4.1.
.NH 2
Implications for other systems
.PP
In this section, we show that the performance of spin-waiting is of
concern on architectures other than the Symmetry Model A.
.PP
Some multiprocessors do not provide hardware cache coherency;
the BBN Butterfly [BBN 1985] is an example.
For these systems, every test of a lock value by a spinning
processor requires a memory access. By inserting a delay between each test,
the effect of spinning on busy processors can be reduced;
backoff can be used to adapt the frequency of reads to the number of
waiting processors.
.PP
The Symmetry Model A has a write-through protocol: when
a processor modifies a location, the value is written to memory and
all old copies of the location in other caches are invalidated.
There is a cost to spin-waiting, even in architectures with a write-back
cache coherency protocol.
In a write-back protocol, the value is stored in the cache and later
written to memory when the cache block is replaced. There are two
major approaches to keeping other caches consistent with the new value: all
old copies in other caches can either be invalidated or updated
with the new value (distributed-write) [Archibald & Baer 1986].
.PP
In the case of an invalidation-based write-back protocol, the spin-waiting
alternatives have much the same effect as with write-through. If
processors spin on the atomic test-and-set operation, the valid copy
of the lock bounces from cache to cache, consuming bus resources.
Provided test-and-sets invalidate all cache copies whether or not
the lock value changes, spinning on a read does not help;
there is still a cascade of repeated invalidations when the lock is
released.
.PP
One possibility, then, is to add hardware to compare the old and
new value of the lock on a test-and-set and to invalidate other copies
only if they differ. While this would improve the performance of
spinning on a read, it does not eliminate the problem.
With $P$ spinning processors, there are $O(P)$ bus requests
per lock acquisition. Each processor must cache miss when the
lock is released; it must also acquire the bus to ensure the
atomicity of its subsequent test-and-set.
.PP
Systems with distributed-write coherency have similar performance.
When a processor performs an atomic operation, every cache with an
old copy is updated with the new value; thus no cache misses occur.
If processors spin on a read, however, there will still be a
rush of processors to try the lock when it is first released.
Since the backoff algorithm reduces the number of unsuccessful lock
attempts, it would reduce the bus load due to spinning even further.
.PP
Explicitly queueing spinning processors can further improve performance. Each
processor in the queue spins on a separate flag; when a processor
finishes with the lock, it passes control of the lock by setting the
flag of the next one in the queue, without invalidating the flags
of the other waiting processors.
.PP
In a related paper, we devised an efficient queue-based spin-waiting
algorithm that uses only $O(1)$ bus transactions per execution of the
critical section [Anderson 1989].
Each arriving processor does an atomic read-and-increment
to obtain a unique sequence number. When a processor finishes with
the lock, it taps the processor with the next highest sequence number;
that processor now owns the lock.
Since processors are sequenced, no atomic
read-then-write instruction is needed to pass control of the lock.
Table 4.1 lists the code for this approach ("myPlace" is a location private
to each processor). Measurements of this algorithm appear in that paper.
.KF
.sp 0.5
.ft C
.ps -1
.vs -1
.TS
center, box;
l | l.
Init flags[0] = HAS_LOCK;
flags[1..P-1] = MUST_WAIT;
queueLast = 0;
_
Lock myPlace = ReadAndIncrement(queueLast);
while (flags[myPlace mod P] == MUST_WAIT)
;
flags[myPlace mod P] = MUST_WAIT;
_
Unlock flags[(myPlace + 1) mod P] = HAS_LOCK;
.TE
.ps +1
.vs +1
.ft P
.ce
\fBTable 4.1: Queue-based algorithm for spin-waiting
.sp 0.5
.KE
.NH
Analytical Results
.PP
We developed a simple queueing network model for our thread package
to demonstrate that the combination of processor degradation due
to bus contention and the effect of lock contention can
account for our measurements. We then used the validated model to
project the performance of our thread package under varying conditions.
.PP
Our model is hierarchical. The low level model represents the
effect of bus contention on processor speed. The high level
model represents the effect of lock contention
on throughput and response time. Since processor speed affects
the amount
of lock contention and the number of spinning processors affects
bus contention and thus processor speed, we iterate between levels
to convergence. We describe the two sub-models in more detail below.
.NH 2
Modelling bus contention
.PP
In the low level model, we represent each processor as a customer
in a closed queueing network. The network has two service centers: a
queueing center for the bus and a delay center for non-bus activity.
Each processor spends some of its time referencing memory through the
bus and thus contending with other processors also using the bus,
and some of its time processing out of its cache, independent of the
activity of other processors. Processor speed is degraded by the
percentage of time spent queueing, but not in service, at the bus.
.KF
.sp 2.0i
.ce
\fBDiagram 5.1: Low level model of bus contention\fR
.sp .5
.KE
.PP
This model is an approximation of the real bus mechanism, which is
considerably more complex [Lovett & Thakkar 1988].
At moderate loads, our model will be pessimistic by predicting more
contention than is actually experienced. Because of the regularity
of the time each processor spends computing between accesses to the bus,
if two processors collide at the bus, they are unlikely to collide
at their next visit. Our model assumes that arrivals are more nearly
independent.
.PP
There are three components to bus utilization. A processor can
be executing user code, thread management code, or spin-waiting,
each with different service demands on the bus. Given these
service demands and the ratio of time each processor spends in each
type of activity, we determine the aggregate service demands
at the bus and at the delay center and use these aggregate demands
to solve the model.
.PP
Since it is difficult to analytically determine the bus demand of a section
of code, we determine a portion of it inductively from measurements.
We provide each processor with its own copy of all data structures;
we then run the code in parallel on each processor. Since there is no
shared data, there can be no contention for software resources; any
delay experienced by a processor relative to when it is running the
code by itself must be due to contention for hardware resources, such as
for memory or the bus. We then match a curve from our model of the bus
to the measured
curve and use the result as the service demand for that section of code.
The curves matched well in practice, deviating only at moderate loads,
as expected.
.PP
Since bus contention may disproportionately impact the critical section
execution time, affecting lock contention in the high level model, we
used this approach separately for the critical section and
non-critical section code within thread management. The critical
section code turns out to account for much of the bus demand of thread
management.
.PP
Even though it could affect bus usage, we did not include in our model
the effect of different numbers of processors on cache hit ratios.
When a processor writes a location, the Symmetry updates both
memory and that processor's cache.
As a result, on a single processor, data that is both written and
read will tend to stay in the cache, avoiding cache misses.
When multiple processors read and write shared data, the cache copies
of the data will be repeatedly invalidated as different processors
update it, resulting in more cache misses than in the single processor
case. Our model therefore underestimates bus demand, making it
optimistic, especially as the bus nears saturation.
.PP
The bus demand of spinning processors
was also determined inductively. $P$ processors were set to run
the critical section with separate copies of the data structures;
by the experiment described above, we know the bus service demand
of these processors. $Q$ processors were set to run a shared copy
of the critical section; one of these processors has the normal
bus service demand, and $Q~-~1$ spin-wait. By measuring the
processor degradation of the $P$ copies, we can determine the
aggregate bus demand of the $Q~-~1$ spinning processors. A two
class model was used, one class representing
processors executing critical sections and one representing
spinning processors. Only the response time of the processors
executing the critical section is important.
.PP
The bus demand, at least for the backoff algorithm, is linear
with the number of processors. While there is no \fIa priori\fR reason
for this, it intuitively makes sense. The effect of adding a spinning
processor with the backoff algorithm is to add two cache misses per
execution of the critical section. The bus demand of other processors
is relatively unaffected. While this invariance would also hold
for the spin on xchgb algorithm, it is less true when processors spin
on memory reads, because the cascade of cache misses is longer for every
processor when more processors are spinning. Note that the graphs in
Section 4 could be used to infer the bus demand of spinning processors.
We did not choose this approach because there is a correlation
between when the processor holding the lock and
when the processors spinning on the lock use the bus.
The curve for the backoff algorithm in Figure 4.1, for example,
is similar to that of an optimistic asymptotic bound.
.NH 2
Modelling lock contention
.PP
In the high level model, we represent each lock in the thread management
path by a separate queueing center. Processing time spent not holding
a lock is modelled as a delay center. Service demands were directly
measured, then the part of each service demand due to bus accesses
was inflated by the bus response time of the low level model.
As in the low level model, each processor is represented as a single
customer in a closed class.
By solving this model, we can determine the average amount of time
each processor spends spin-waiting for a lock versus executing thread
operations or user code. This ratio is then used as an input to
the low level model.
(Note that it is a simple matter in this model
to add queueing centers if the application-level code does further locking.)
.KF
.sp 2.0i
.ce
\fBDiagram 5.2: High level model of lock contention for the local freelist alternative\fR
.sp .5
.KE
.PP
If the time between thread operations is deterministic, our model
is pessimistic at moderate loads. As for the bus, if two processors
collide at a lock, the effect of deterministic processing times is to
reduce the likelihood that they will collide at the next visit.
Figure 3.2 shows this effect. The curves are similar in shape to
asymptotic optimistic bounds, since the processing time to do each
thread operation is deterministic. Figure 3.3 does not show this effect,
since the user computation for each thread was randomly chosen from a
uniform distribution.
.PP
Our model does not explicitly represent an application's distribution
of parallelism, although Figure 3.4 shows that this affects performance.
We chose not to include this in our model since the distribution,
and more importantly the effect of lock queueing delay on that
distribution, are almost always application-dependent.
.PP
Given the distribution, the model could be evaluated separately for each
population of threads; these separate evaluations could then be
averaged, weighted by the proportion of time for that population.
The population of the high level model should be the minimum between the
number of processors and the number of threads, reflecting the number
of active processors. The population of the low level model should
be set similarly, except that since idle processors consume bus resources,
a second class should be added to represent them.
.PP
This method of separate evaluations ignores the fact that lock
contention can only occur when the parallelism is being incremented or
decremented; we believe that any distortion introduced by the adaptive
nature of the mechanism will be outweighed by the effects of lock and
bus contention. Ni and Wu [1985] also discuss this issue.
.NH 2
Comparison with measured results, and projections
.PP
Figure 5.1 compares our model results with our measurement results
previously reported in Figure 3.3. We modelled two
alternatives: per-processor ready queues (local readyq) and
per-processor free lists with a central ready queue (local freelist).
Our model agrees well with the measurements, within 5% except
for the central ready queue with 18 processors. The model predicts
the shape of the curve, but is somewhat optimistic; this appears to
be due to underestimating the bus demand, which is important in
determining the effective size of the critical section. The model
does capture the difference between the alternatives.
.PP
Having validated our model, we used it to investigate the effect
of varying key parameters. Figure 5.2 shows speedup of a hypothetical
application with 20 processors
as a function of the amount of user computation per thread. As we
would expect, as an application uses finer-grained
parallelism (smaller amounts of computation per thread), the central
lock on the ready queue becomes a bottleneck.
For sufficiently
coarse-grained parallelism, the performance of the thread package ceases
to matter. In the limit, even DYNIX processes could be used.
.KF
.sp 3.0i
.ce
\fBFigure 5.1: Comparison of analytic and measured results from Figure 3.3\fR
.KE
.KF
.sp 3.0i
.ce 2
\fBFigure 5.2: Speedup vs. \(*msec. of user computation
per thread, 20 processors, bus load = 5% (analytic)\fR
.sp .5
.KE
.KF
.sp 3.0i
.ce
\fBFigure 5.3: Speedup vs. bus load, user work = 200 \(*msec., 20 processors (analytic)\fR
.sp .5
.KE
.PP
Contention for the bus can also reduce the difference between the
alternatives. Figure 5.3 shows speedup as a function of the
percentage usage of the bus by each thread. As the bus usage increases,
the bus limits the speedup with local ready queues, but
it also limits the speedup with the central ready queue, since
bus contention inflates the critical section time.
.PP
On the other hand, the central ready queue lock can again limit speedup
even for more coarsely-grained parallelism, given a sufficient
number of processors. Figure 5.4 shows speedup as a function
of the number of processors when threads each compute for 2 milliseconds.
The sharp dropoff for the central ready queue alternative shows the
inherent instability of a system where spinning processors consume resources.
.KF
.sp 3.0i
.ce
\fBFigure 5.4: Speedup vs. number of processors, user work = 2 msec. (analytic)\fR
.sp
.KE
.bp
.NH
Conclusions
.PP
Threads have become a common element of new languages and operating systems.
Efficient thread management is critical to achieving good performance
from parallel applications. We have studied the
performance
implications of several thread management and locking alternatives.
We showed that:
.IP \(bu
It is possible to implement a fast thread package. Simplicity is
crucial for this.
.IP \(bu
For fine-grained parallelism, small changes in data structures
and locking have a significant effect on both latency and throughput.
.IP \(bu
Per-processor data structures can be used to improve throughput;
if a resource is not scarce, localizing data can avoid locking,
improving latency as well.
.IP \(bu
Spin-waiting can delay not only the processor waiting for a lock,
but other processors doing work. This appears to be independent
of the cache coherency protocol.
.IP \(bu
The cost of spin-waiting can be reduced by using an
Ethernet-style backoff or a queue-based algorithm.
.IP \(bu
A simple queueing model can accurately predict the effect of a combination
of factors on the performance of shared-memory multiprocessors.
.PP
An area of future research is to determine the extent to which our results,
developed in the context of thread management systems,
also apply to application programs that exploit fine-grained parallelism
on shared-memory multiprocessors.
.SH
Acknowledgements
.PP
We would like to thank Dave Wagner for suggesting that an Ethernet-style
algorithm might solve the spin-waiting problem. A preliminary version of
this paper appeared in the Proceedings
of the 1989 ACM SIGMETRICS and Performance '89 International
Conference on Measurement and Modeling of Computer Systems,
\fIPerformance Evaluation Review 17,\fR1 (May 1989),
Copyright 1989, Association for Computing Machinery, Inc.,
reprinted by permission.
.SH
References
.nh
.ps 11
.nr PS 11
.vs 12
.nr VS 12
.LP
[Accetta et al. 1986]
.RS
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian,
and M. Young.
Mach: A New Kernel Foundation For UNIX Development.
\fIProc. Summer 1986 USENIX Technical Conference and Exhibition\fR,
June 1986, pp. 93-112.
.RE
.LP
[Agarwal & Cherian 1989]
.RS
A. Agarwal and M. Cherian.
Adaptive Backoff Synchronization Techniques.
\fIProc. 16th International Symposium on Computer Architecture\fR,
June, 1989, pp. 396-406.
.RE
.LP
[Anderson 1989]
.RS
Thomas E. Anderson. The Performance Implications of Spin-Waiting Alternatives
for Shared-Memory Multiprocessors.
\fIProc. 1989 International Conference on Parallel Processing\fR, Aug. 1989.
.RE
.LP
[Archibald & Baer 1986]
.RS
J. Archibald and J.-L. Baer.
Cache Coherence Protocols: Evaluation Using a Multiprocessor
Simulation Model.
\fIACM Transactions on Computer Systems\fR, vol. 4, no. 4, Nov. 1986,
pp. 273-298.
.RE
.LP
[Bach & Buroff 1984]
.RS
M. J. Bach and S. J. Buroff.
Multiprocessor UNIX Operating Systems.
\fIAT&T Bell Laboratories Technical Journal\fR,
vol. 63, no. 8, Oct. 1984, pp. 1733-1749.
.RE
.LP
[BBN 1985]
.RS
BBN Laboratories. Butterfly Parallel Processor Overview. 1985.
.RE
.LP
[Bershad et al. 1988a]
.RS
Brian Bershad, Edward Lazowska, and Henry Levy.
Presto: A System for Object-Oriented Parallel Programming.
\fISoftware: Practice
and Experience\fR, vol. 18, no. 8, Aug. 1988, pp. 713-732.
.RE
.LP
[Bershad et al. 1988b]
.RS
Brian Bershad, Edward Lazowska, Henry Levy, and David Wagner.
An Open Environment for Building Parallel Programming Systems.
\fIProc. ACM/SIGPLAN PPEALS 1988\fR, pp. 1-9.
.RE
.LP
[Dritz & Boyle 1987]
.RS
Kenneth W. Dritz and James M. Boyle.
Beyond "Speedup": Performance Analysis
of Parallel Programs.
Technical Report ANL-87-7, Mathematics and
Computer Science Division, Argonne National Laboratory,
Feb. 1987.
.RE
.LP
[Eager et al. 1986]
.RS
Derek Eager, Edward Lazowska, and John Zahorjan.
Adaptive Load Sharing in Homogeneous Distributed Systems.
\fIIEEE Transactions on Software Engineering\fR,
vol. 12, no. 5, May 1986, pp. 662-675.
.RE
.LP
[Edler et al. 1988]
.RS
Jan Edler, Jim Lipkis, and Edith Schonberg.
Process Management for Highly Parallel UNIX Systems.
Ultracomputer Note #136, April 1988.
.RE
.LP
[Hoare 1978]
.RS
C.A.R. Hoare.
Communicating Sequential Processes.
\fICommunications of the ACM\fR, vol. 21, no. 8, Aug. 1978, pp. 666-677.
.RE
.LP
[Holt 1982]
.RS
R. Holt.
A Short Introduction to Concurrent Euclid.
\fISIGPLAN Notices\fR, vol. 17, May 1982, pp. 60-79.
.RE
.LP
[Jul et al. 1988]
.RS
Eric Jul, Henry Levy, Norman Hutchinson, and Andrew Black.
Fine-Grained Mobility in the Emerald System.
\fIACM Transactions on Computer Systems\fR, vol. 6, no. 1, Feb. 1988,
pp. 109-133.
.RE
.LP
[Kumar & Gonsalves 1977]
.RS
B. Kumar and Timothy Gonsalves.
Modelling and Analysis of Distributed Software Systems.
\fIProc. 7th ACM Symposium on Operating Systems Principles\fR,
Dec. 1977, pp. 2-8.
.RE
.LP
[Lampson & Redell 1980]
.RS
B.W. Lampson and D.D. Redell.
Experiences with Processes and Monitors in Mesa.
\fICommunications of the ACM\fR, vol. 23, no. 2, Feb. 1980, pp. 104-117.
.RE
.LP
[Lovett & Thakkar 1988]
.RS
Tom Lovett and Shreekant Thakkar.
The Symmetry Multiprocessor System.
\fIProc. 1988 International Conference on Parallel Processing\fR,
pp. 303-310.
.RE
.LP
[Metcalfe & Boggs 1976]
.RS
Robert Metcalfe and David Boggs.
Ethernet: Distributed Packet Switching for Local Computer Networks.
\fICommunications of the ACM\fR, vol. 19, no. 7, July 1976, pp. 395-404.
.RE
.LP
[Mundie & Fisher 1985]
.RS
D.A. Mundie and D.A. Fisher.
Parallel Processing in Ada.
\fIIEEE Computer\fR, Aug. 1985, pp. 20-25.
.RE
.LP
[Ni & Wu 1985]
.RS
Lionel Ni and Ching-Fern Wu.
Design Trade-offs for Process Scheduling in Tightly Coupled Multiprocessor
Systems.
\fIIEEE Transactions on Software Engineering\fR, vol. 15, no. 3, March 1989,
pp. 327-334.
.RE
.LP
[Scott et al. 1988]
.RS
Michael Scott, Thomas LeBlanc, and Brian Marsh.
Design Rationale for
Psyche, a General Purpose Multiprocessor Operating System.
\fIProc. 1988 International Conference on Parallel Processing\fR,
August, 1988.
.RE
.LP
[Sequent 1988]
.RS
Sequent Computer Systems, Inc.
Symmetry Technical Summary.
.RE
.LP
[Thacker et al. 1988]
.RS
Charles Thacker, Lawrence Stewart, and Edward Satterthwaite Jr.
Firefly: A Multiprocessor Workstation.
\IEEE Transactions on Computers\fR, vol. 37, no. 8, Aug. 1988, pp. 909-920.
.RE
.LP
[Vandevoorde & Roberts 1988]
.RS
Mark Vandevoorde and Eric Roberts.
WorkCrews: An Abstraction for Controlling Parallelism.
\fIInternational Journal of Parallel Programing\fR, vol. 17, no. 4,
Aug. 1988, pp. 347-366.
.RE
.LP
[Wagner et al. 1989]
.RS
David Wagner, Edward Lazowska, and Brian Bershad.
Techniques for Efficient Shared-Memory Parallel Simulation.
\fIDistributed Simulation 1989\fR, Society for Computer Simulation,
March 1989, pp. 29-37.
.RE
.LP
[Zahorjan et al. 1988]
.RS
John Zahorjan, Edward Lazowska, and Derek Eager.
Spinning Versus Blocking in Parallel Systems with Uncertainty.
\fIProc. International Seminar on the Performance of
Distributed and Parallel Systems\fR, North Holland, Dec. 1988.
.RE